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ABSTRACT 

We show that it is possible to significantly improve the accu- 
racy of a general class of histogram queries while satisfying 
differential privacy. Our approach carefully chooses a set 
of queries to evaluate, and then exploits consistency con- 
straints that should hold over the noisy output. In a post- 
processing phase, we compute the consistent input most 
likely to have produced the noisy output. The final out- 
put is differentially-private and consistent, but in addition, 
it is often much more accurate. We show, both theoreti- 
cally and experimentally, that these techniques can be used 
for estimating the degree sequence of a graph very precisely, 
and for computing a histogram that can support arbitrary 
range queries accurately. 

1. INTRODUCTION 

Recent work in differential privacy [9] has shown that it is 
possible to analyze sensitive data while ensuring strong pri- 
vacy guarantees. Differential privacy is typically achieved 
through random perturbation: the analyst issues a query 
and receives a noisy answer. To ensure privacy, the noise 
is carefully calibrated to the sensitivity of the query. Infor- 
mally, query sensitivity measures how much a small change 
to the database — such as adding or removing a person's pri- 
vate record — can affect the query answer. Such query mech- 
anisms are simple, efficient, and often quite accurate. In 
fact, one mechanism has recently been shown to be optimal 
for a single counting query [10] — i.e., there is no better noisy 
answer to return under the desired privacy objective. 

However, analysts typically need to compute multiple sta- 
tistics on a database. Differentially private algorithms ex- 
tend nicely to a set of queries, but there can be difficult 
trade-offs among alternative strategies for answering a work- 
load of queries. Consider the analyst of a private student 
database who requires answers to the following queries: the 
total number of students, xt, the number of students xa, 
xb, xc, xd, xf receiving grades A, B, C, D, and F respec- 
tively, and the number of passing students, Xp (grade D or 
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Figure 1: Our approach to querying private data. 



higher) . 

Using a differentially private interface, a first alternative 
is to request noisy answers for just (xa, xb^ xc, xd,xf) and 
use those answers to compute answers for xt and Xp by sum- 
mation. The sensitivity of this set of queries is 1 because 
adding or removing one tuple changes exactly one of the five 
outputs by a value of one. Therefore, the noise added to in- 
dividual answers is low and the noisy answers are accurate 
estimates of the truth. Unfortunately, the noise accumulates 
under summation, so the estimates for xt and Xp are worse. 

A second alternative is to request noisy answers for all 
queries {xt,Xp, xa, xb,xc,xd,xf)- This query set has sen- 
sitivity 3 (one change could affect three return values, each 
by a value of one), and the privacy mechanism must add 
more noise to each component. This means the estimates for 
xa,xb, xc,xd,xf are worse than above, but the estimates 
for Xt and Xp may be more accurate. There is another con- 
cern, however: inconsistency. The noisy answers are likely to 
violate the following constraints, which one would naturally 
expect to hold: xt — Xp-\- xf and Xp — xa + xb+xc+xd- 
This means the analyst must find a way to reconcile the fact 
that there are two different estimates for the total number 
of students and two different estimates for the number of 
passing students. We propose a technique for resolving in- 
consistency in a set of noisy answers, and show that doing 
so can actually increase accuracy. As a result, we show that 
strategies inspired by the second alternative can be superior 
in many cases. 



Overview of Approach. Our approach, shown pictorially 
in Figure 1, involves three steps. 

First, given a task — such as computing a histogram over 
student grades — the analyst chooses a set of queries Q to 
send to the data owner. The choice of queries will depend on 
the particular task, but in this work they are chosen so that 
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L- (C'ooo, Cooi, Coio, Coil) 

H : (Co**, Coo*, Coi*, Cooo, Cooi, Coio, Coil) 

S : sort(L) 
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L(/) = (2,0,10,2) L(/) = 
H(7) = (14,2,12,2,0,10,2) H(/) 
S(J) = (0,2,2, 10) !(/) = 
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Figure 2: (a) Illustration of sample data representing a bipartite graph of network connections; (b) Definitions 
and sample values for alternative query sequences: L counts the number of connections for each source, H 
provides a hierarchy of range counts, and S returns an ordered degree sequence for the implied graph. 



constraints hold among the answers. For example, rather 
than issue {xa ,xb,xc,xd,xf)j the analyst would formulate 
the query as (xt,Xp,XA,XB,xc,XD,XF): which has consis- 
tency constraints. The query set Q is sent to the data owner. 

In the second step, the data owner answers the set of 
queries, using a standard differentially-private mechanism [9], 
as follows. The queries are evaluated on the private database 
and the true answer Q(/) is computed. Then random in- 
dependent noise is added to each answer in the set, where 
the data owner scales the noise based on the sensitivity of 
the query set. The set of noisy answers q is sent to the an- 
alyst. Importantly, because this step is unchanged from [9], 
it offers the same differential privacy guarantee. 

The above step ensures privacy, but the set of noisy an- 
swers returned may be inconsistent. In the third and final 
step, the analyst post-processes the set of noisy answers to 
resolve inconsistencies among them. We propose a novel ap- 
proach for resolving inconsistencies, called constrained infer- 
ence, that finds a new set of answers q that is the "closest" 
set to q that also satisfies the consistency constraints. 

For two histogram tasks, our main technical contributions 
are efficient techniques for the third step and a theoretical 
and empirical analysis of the accuracy of q. The surprising 
finding is that q can be significantly more accurate than q. 

We emphasize that the constrained inference step has no 
impact on the differential privacy guarantee. The analyst 
performs this step without access to the private data, using 
only the constraints and the noisy answers, q. The noisy 
answers q are the output of a differentially private mech- 
anism; any post-processing of the answers cannot dimin- 
ish this rigorous privacy guarantee. The constraints are 
properties of the query, not the database, and therefore 
known by the analyst a priori. For example, the constraint 
Xp = XA -\- xb -\- xc -\- xd is simply the definition of Xp. 

Intuitively, however, it would seem that if noise is added 
for privacy and then constrained inference reduces the noise, 
some privacy has been lost. In fact, our results show that 
existing techniques add more noise than is strictly necessary 
to ensure differential privacy. The extra noise provides no 
quantifiable gain in privacy but does have a significant cost 
in accuracy. We show that constrained inference can be an 
effective strategy for boosting accuracy. 

The increase in accuracy we achieve depends on the input 
database and the privacy parameters. For instance, for some 
databases and levels of noise the perturbation may tend to 



produce answers that do not violate the constraints. In this 
case the inference step would not improve accuracy. But we 
show that our inference process never reduces accuracy and 
give conditions under which it will boost accuracy. In prac- 
tice, we find that many real datasets have data distributions 
for which our techniques significantly improve accuracy. 

Histogram tasks. We demonstrate this technique on two 
specific tasks related to histograms. For relational schema 
R(A, 5, . . . ), we choose one attribute A on which histograms 
are built (called the range attribute). We assume the domain 
of A, dom, is ordered. 

We explain these tasks using sample data that will serve 
as a running example throughout the paper, and is also the 
basis of later experiments. The relation R{src, dst), shown 
in Fig. 2, represents a trace of network communications be- 
tween a source IP address (src) and a destination IP address 
(dst). It is bipartite because it represents flows through a 
gateway router from internal to external addresses. 

In a conventional histogram, we form disjoint intervals for 
the range attribute and compute counting queries for each 
specified range. In our example, we use src as the range at- 
tribute. There are four source addresses present in the table. 
If we ask for counts of all unit-length ranges, then the his- 
togram is simply the sequence (2, 0, 10, 2) corresponding to 
the (out) degrees of the source addresses (000, 001, 010, Oil). 

Our first histogram task is an unattributed histogram, 
in which the intervals themselves are irrelevant to the anal- 
ysis and so we report only a multiset of frequencies. For 
the example histogram, the multiset is {0,2,2, 10}. An im- 
portant instance of an unattributed histogram is the de- 
gree sequence of a graph, a crucial measure that is widely 
studied [18]. If the tuples of R represent queries submit- 
ted to a search engine, and A is the search term, then an 
unattributed histogram shows the frequency of occurrence 
of all terms (but not the terms themselves), and can be used 
to study the distribution. 

For our second histogram task, we consider more con- 
ventional sequences of counting queries in which the inter- 
vals studied may be irregular and overlapping. In this case, 
simply returning unattributed counts is insufficient. And 
because we cannot predict ahead of time all the ranges of 
interest, our goal is to compute privately a set of statistics 
sufficient to support arbitrary interval counts and thus any 
histogram. We call this a universal histogram. 
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Continuing the example, a universal histogram allows the 
analyst to count the number of packets sent from any single 
address (e.g., the counts from source address 010 is 10), or 
from any range of addresses (e.g., the total number of pack- 
ets is 14, and the number of packets from a source address 
matching prefix 01* is 12). 

While a universal histogram can be used compute an unat- 
tributed histogram, we distinguish between the two because 
we show the latter can be computed much more accurately. 

Contributions. For both unattributed and universal his- 
tograms, we propose a strategy for boosting the accuracy 
of existing differentially private algorithms. For each task, 
(1) we show that there is an efficiently-computable, closed- 
form expression for the consistent query answer closest to 
a private randomized output; (2) we prove bounds on the 
error of the inferred output, showing under what conditions 
inference boosts accuracy; (3) we demonstrate significant 
improvements in accuracy through experiments on real data 
sets. Unattributed histograms are extremely accurate, with 
error at least an order of magnitude lower than existing tech- 
niques. Our approach to universal histograms can reduce er- 
ror for larger ranges by 45-98%, and improves on all ranges 
in some cases. 

2. BACKGROUND 

In this section, we introduce the concept of query se- 
quences and how they can be used to support histograms. 
Then we review differential privacy and show how queries 
can be answered under differential privacy. Finally, we for- 
malize our constrained inference process. 

All of the tasks considered in this paper are formulated as 
query sequences where each element of the sequence is a sim- 
ple count query on a range. We write intervals as [x, y] for 
x,y ^ dom, and abbreviate interval [x,x] as [x]. A counting 
query on range attribute A is: 

c([x,y]) = Select count(*) From R Where x < R.A < y 

We use Q to denote a generic query sequence (please 
see Appendix A for an overview of notational conventions). 
When Q is evaluated on a database instance /, the output, 
Q(/), includes one answer to each counting query, so Q(/) 
is a vector of non- negative integers. The z*^ query in Q is 
Qli]. 

We consider the common case of a histogram over unit- 
length ranges. The conventional strategy is to simply com- 
pute counts for all unit-length ranges. This query sequence 
is denoted L: 

L = ( c([xi]), . . . c([xn]) >, Xi e dom 

Example 1. Using the example in Fig 2, we assume the 
domain of src contains just the 4 addresses shown. Query L 
is (c([000]),c([001]),c([010]),c([011])) andL{I) = (2,0, 10,2). 

2.1 Differential Privacy 

Informally, an algorithm is differentially private if it is 
insensitive to small changes in the input. Formally, for any 
input database /, let nbrs{I) denote the set of neighboring 
databases, each differing from / by at most one record; i.e., 
if I' e nhrs{I), then |(/ - I') U {I' - I)\ = 1. 

Definition 2.1 (e-DiPPERENTiAL privacy). Algorithm 
A is e- differentially private if for all instances I, any I' G 



nbrs{I), and any subset of outputs S C Range{A) , the fol- 
lowing holds: 

Pr[A{I) eS]< exp(e) x Pr[A{l') G S] 

where the probability is taken over the randomness of the A. 

Differential privacy has been defined inconsistently in the lit- 
erature. The original concept, called e-indistinguishability [9], 
defines neighboring databases using hamming distance rather 
than symmetric difference (i.e., I' is obtained from / by re- 
placing a tuple rather than adding/removing a tuple). The 
choice of definition affects the calculation of query sensi- 
tivity. We use the above definition (from Dwork [7]) but 
observe that our results also hold under indistinguishability, 
due to the fact that e-differential privacy (as defined above) 
implies 2e-indistinguishability. 

To answer queries under differential privacy, we use the 
Laplace mechanism [9], which achieves differential privacy 
by adding noise to query answers, where the noise is sam- 
pled from the Laplace distribution. The magnitude of the 
noise depends on the query's sensitivity, defined as follows 
(adapting the definition to the query sequences considered 
in this paper). 

Definition 2.2 (Sensitivity). Let Q be a sequence of 
counting queries. The sensitivity of Q, denoted Sq, is 

Throughout the paper, we use ||X — Y||p to denote the Lp 
distance between vectors X and Y. 

Example 2. The sensitivity of query \j is 1. Database I' 
can be obtained from I by adding or removing a single record, 
which affects exactly one of the counts in L by exactly 1. 

Given query Q, the Laplace mechanism first computes 
the query answer Q(/) and then adds random noise indepen- 
dently to each answer. The noise is drawn from a zero-mean 
Laplace distribution with scale a. As the following propo- 
sition shows, differential privacy is achieved if the Laplace 
noise is scaled appropriately to the sensitivity of Q and the 
privacy parameter e. 

Proposition 1 (Laplace mechanism [9]). Let Q be 

a query sequence of length d, and let {Lap{a))^ denote a 
d-length vector of i.i.d. samples from a Laplace with scale 
a. The randomized algorithm Q that takes as input database 
I and outputs the following vector is e- differentially private: 

Q(7) = Q(/) + {Lap{AQ/e)f 

We apply this technique to the query L. Since, L has sen- 
sitivity 1, the following algorithm is e-differentially private: 

L(7) = L(7) + (Lap(l/£)>" 

We rely on Proposition 1 to ensure privacy for the query 
sequences we propose in this paper. We emphasize that the 
proposition holds for any query sequence Q, regardless of 
correlations or constraints among the queries in Q. Such 
dependencies are accounted for in the calculation of sensi- 
tivity. (For example, consider the correlated sequence Q 
that consists of the same query repeated k times, then the 
sensitivity of Q is /c times the sensitivity of the query.) 



3 



We present the case where the analyst issues a single query 
sequence Q. To support multiple query sequences, the pro- 
tocol that computes a ei-differentially private response to 
the z*^ sequence is ei)-differentially private. 

To analyze the accuracy of the randomized query sequences 
proposed in this paper we quantify their error. Q can be 
considered an estimator for the true value Q(/). We use the 
common Mean Squared Error as a measure of accuracy. 

Definition 2.3 (Error). For a randomized query se- 
quence Q whose input is Q{I), the error{Cl) is X^^ECQH — 
Q[z])^ Here E is the expectation taken over the possible ran- 
domness in generating Q. 

For example, error{lj) — E(L[i] — L[i])^ which simplifies 
to: n'^[Lap{l/ef] = nVar{Lap{l/ e)) = 2n/e^ 

2.2 Constrained Inference 

While L can be used to support unattributed and univer- 
sal histograms under differential privacy, the main contribu- 
tion of this paper is the development of more accurate query 
strategies based on the idea of constrained inference. The 
specific strategies are described in the next sections. Here, 
we formulate the constrained inference problem. 

Given a query sequence Q, let 7q denote a set of con- 
straints which must hold among the (true) answers. The 
constrained inference process takes the randomized output 
of the query, denoted q = Q(/), and finds the sequence of 
query answers q that is "closest" to q and also satisfies the 
constraints of 7q. Here closest is determined by L2 distance, 
and the result is the minimum L2 solution: 

Definition 2.4 (Minimum L2 solution). Let Q be a 
query sequence with constraints 7q. Given a noisy query 
sequence q = a minimum L2 solution, denoted q, is 

a vector q that satisfies the constraints 7q and at the same 
time minimizes 1 1^ — ^| I2 • 

We use Q to denote the two step randomized process in 
which the data owner first computes q — Q(/) and then 
computes the minimum L2 solution from q and 7q. (Al- 
ternatively, the data owner can release q and the latter step 
can be done by the analyst.) Importantly, the inference step 
has no impact on privacy, as stated below. (Proofs appear 
in the Appendix.) 

Proposition 2. If Cl satisfies e- differential privacy, then 
Q satisfies e-differential privacy. 

3. UNATTRIBUTED HISTOGRAMS 

To support unattributed histograms, one could use the 
query sequence L. However, it contains "extra" information — 
the attribution of each count to a particular range — which 
is irrelevant for an unattributed histogram. Since the associ- 
ation between L[i] and i is not required, any permutation of 
the unit-length counts is a correct response for the unattr- 
ibuted histogram. We formulate an alternative query that 
asks for the counts of L in rank order. As we will show, 
ordering does not increase sensitivity, but it does introduce 
inequality constraints that can be exploited by inference. 

Formally, let ai refer to the answer to L[i] and let U = 
{ai, . . . , ttn} be the multiset of answers to queries in Q. We 
write rankiiU) to refer to the z*^ smallest answer in U. Then 
the query S is defined as 

S = (rankiiU), . . . , rankniU)) 



Example 3. In the example in Fig 2, we have L(/) = 
(2, 0, 10, 2) while S(/) = (0, 2, 2, 10). Thus, the answer S(/) 
contains the same counts as L(/) but in sorted order. 

To answer S under differential privacy, we must determine 
its sensitivity. 

Proposition 3. The sensitivity of S is 1. 

By Propositions 1 and 3, the following algorithm is e- 
differentially private: 

S(7) = S(/) + (Lap(lA)>'' 

Since the same magnitude of noise is added to S as to L, 
the accuracy of S and L is the same. However, S implies 
a powerful set of constraints. Notice that the ordering oc- 
curs before noise is added. Thus, the analyst knows that 
the returned counts are ordered according to the true rank 
order. If the returned answer contains out-of-order counts, 
this must be caused by the addition of random noise, and 
they are inconsistent. Let 7s denote the set of inequalities 
S[i] < S[i + 1] for 1 < i < n. We show next how to exploit 
these constraints to boost accuracy. 

3.1 Constrained Inference: Computing s 

As outlined in the introduction, the analyst sends query S 
to the data owner and receives a noisy answer s — S(/), the 
output of the differentially private algorithm S evaluated on 
the private database /. We now describe a technique for 
post-processing s to find an answer that is consistent with 
the ordering constraints. 

Formally, given s, the objective is to find an s that mini- 
mizes ||s — s||2 subject to the constraints s[i] < s[i + 1] for 
1 < z < n. The solution has a surprisingly elegant closed- 
form. Let j] be the subsequence oi j — i -\- 1 elements: 
s[i -h 1], Let M[i,j] be the mean of these 

elements, i.e. M[i,j] = Ylk=i - ^ + !)• 

Theorem 1. Denote Lk = Tn.mj^[k,n]^^^ie[i,j] ^[hj] (^i^d 
Uk = maxi^[i^jt] minj^[^^^] M['i, . The minimum L2 solu- 
tion s, is unique and given by: s[k] = Lk = Uk- 

Since we first stated this result in a technical report [13], 
we have learned that this problem is an instance of isotonic 
regression (i.e., least squares regression under ordering con- 
straints on the estimands). The statistics literature gives 
several characterizations of the solution, including the above 
min-max formulas (cf. Barlow et al. [3]), as well as linear 
time algorithms for computing it (cf. Barlow et al. [2]). 

Example 4. We give three examples of s and its closest 
ordered sequence s. First, suppose s = (9,10,14). Since s 
is already ordered, s = s. Second, s = (9,14, 10) , where 
the last two elements are out of order. The closest ordered 
sequence is s = (9,12,12). Finally, let s = (14,9,10,15). 
The sequence is in order except for s[l]. While changing 
the first element from I4 to 9 would make it ordered, its 
distance from s would 6e (14 — 9)^ = 25. In contrast, s = 
(11,11,11,15) and ||s - s||2 = 14. 

3.2 Utility Analysis: the Accuracy of s 

Prior work in isotonic regression has shown inference can- 
not hurt, i.e., the accuracy of S is no lower than S [14]. 
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Figure 3: Example of how s reduces the error of s. 

However, we are not aware of any results that give condi- 
tions for which S is more accurate than S. Before presenting 
a theoretical statement of such conditions, we first give an 
illustrative example. 

Example 5. Figure 3 shows a sequence S(/) along with 
a sampled s and inferred s. While the values in s deviate 
considerably from S{I), s lies very close to the true answer. 
In particular, for subsequence [1, 20], the true sequence S(/) 
is uniform and the constrained inference process effectively 
averages out the noise of s. At the twenty- first position, 
which is a unique count in S{I), and constrained inference 
does not refine the noisy answer, i.e., s[21] = s[21]. 

Fig 3 suggests that error{S) will be low for sequences in 
which many counts are the same (Fig 7 in Appendix C gives 
another intuitive view of the error reduction) . The following 
theorem quantifies the accuracy of S precisely. Let n and 
d denote the number of values and the number of distinct 
values in S(/) respectively. Let ni, n2, . . . , be the number 
of times each of the d distinct values occur in S(/) (thus 

Theorem 2. There exist constants ci and C2 independent 
of n and d such that 

error(S) < 2. ^ 

i=l 

Thus error{S) = 0{d\og^ ^Z^^) whereas error{S) = B(n/e^). 

The above theorem shows that constrained inference can 
boost accuracy, and the improvement depends on proper- 
ties of the input S(/). In particular, if the number of dis- 
tinct elements d is 1, then err or (S) = 0(log^n/e^), while 
error{S) = B(n/e^). On the other hand, if d = n, then 
error (S) = 0(n/e'^) and thus both error (S) and error (S) 
scale linearly in n. For many practical applications, d <^ n, 
which makes error(S) significantly lower than err or (S). In 
Sec. 5, experiments on real data demonstrate that the error 
of S can be orders of magnitude lower than that of S. 

4. UNIVERSAL HISTOGRAMS 

While the query sequence L is the conventional strategy 
for computing a universal histogram, this strategy has lim- 
ited utility under differential privacy. While accurate for 
small ranges, the noise in the unit-length counts accumu- 
lates under summation, so for larger ranges, the estimates 
can easily become useless. 



We propose an alternative query sequence that, in ad- 
dition to asking for unit- length intervals, asks for interval 
counts of larger granularity. To ensure privacy, slightly more 
noise must be added to the counts. However, this strategy 
has the property that any range query can be answered via 
a linear combination of only a small number of noisy counts, 
and this makes it much more accurate for larger ranges. 

Our alternative query sequence, denoted H, consists of a 
sequence of hierarchical intervals. Conceptually, these inter- 
vals are arranged in a tree T. Each node v ^ T corresponds 
to an interval, and each node has k children, correspond- 
ing to k equally sized subintervals. The root of the tree is 
the interval [xi,Xn], which is recursively divided into subin- 
tervals until, at leaves of the tree, the intervals are unit- 
length, [xi], [X2], . . . , [xn]. For notational convenience, we 
define the height of the tree £ as the number of nodes, rather 
than edges, along the path from a leaf to the root. Thus, 
i — \og^ n + 1. To transform the tree into a sequence, we 
arrange the interval counts in the order given by a breadth- 
first traversal of the tree. 




Figure 4: The tree T associated with query H for 
the example in Fig. 2 for k = 2, 

Example 6. Continuing from the example in Fig 2, we 
describe H for the src domain. The intervals are arranged 
into a binary {k = 2) tree, as shown in Fig 4- The root is 
associated with the interval [0**], which is evenly subdivided 
among its children. The unit-length intervals at the leaves 
are [000], [001], [010], [Oil]. The height of the tree is i = 3. 

The intervals of the tree are arranged into the query se- 
quence H = (Co**, Coo*, Coi*, Cooo, C'ooi, C'oio, C^oii). Eval- 
uated on instance I from Fig. 2, the answer is H(/) = 
(14,2, 12,2,0, 10,2). 

To answer H under differential privacy, we must determine 
its sensitivity. As the following proposition shows, H has a 
larger sensitivity than L. 

Proposition 4. The sensitivity of H is £. 

By Propositions 1 and 4, the following algorithm is e- 
differentially private: 

H(/) = H(7) + (LapWe))'" 

where m is the length of sequence H, equal to the number 
of counts in the tree. 

To answer a range query using H, a natural strategy is to 
sum the fewest number of sub-intervals such that their union 
equals the desired range. However, one challenge with this 
approach is inconsistency: in the corresponding tree of noisy 
answers, there may be a parent count that does not equal 
to the sum of its children. This can be problematic: for 
example, an analyst might ask one interval query and then 
ask for a sub-interval and receive a larger count. 

We next look at how to use the arithmetic constraints 
between parent and child counts (denoted 7h) to derive a 
consistent, and more accurate, estimate H. 
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4.1 Constrained Inference: Computing h 

The analyst receives h = H(/), the noisy output from 
the differentially private algorithm H. We now consider the 
problem of finding the minimum L2 solution: to find the h 
that minimizes \\h — h\\2 and also satisfies the consistency 
constraints 7h- 

This problem can be viewed as an instance of linear regres- 
sion. The unknowns are the true counts of the unit-length 
intervals. Each answer in /i is a fixed linear combination of 
the unknowns, plus random noise. Finding h is equivalent 
to finding an estimate for the unit-length intervals. In fact, 
h is the familiar least squares solution. 

Although the least squares solution can be computed via 
linear algebra, the hierarchical structure of this problem in- 
stance allows us to derive an intuitive closed form solution 
that is also more efficient to compute, requiring only linear 
time. Let T be the tree corresponding to h; abusing nota- 
tion, for node v E T, we write h[v] to refer to the interval 
associated with v. 

First, we define a possibly inconsistent estimate z[v] for 
each node v ^ T. The consistent estimate h[v] is then de- 
scribed in terms of the z[v] estimates. z[v] is defined recur- 
sively from the leaves to the root. Let / denote the height 
of node v and succ{v) denote the set of v's children. 

j h[v], if V is a leaf node 

The intuition behind z[v] is that it is a weighted average of 
two estimates for the count at in fact, the weights are 
inversely proportional to the variance of the estimates. 

The consistent estimate h is defined recursively from the 
root to the leaves. At the root r, h[r] is simply z[r]. As 
we descend the tree, if at some node we have h[u] ^ 
^ z\w], then we adjust the values of each descen- 

dant by dividing the difference /^^M-Z^^esncc(n) ^[^] equally 
among the k descendants. The following theorem states that 
this is the minimum L2 solution. 

Theorem 3. Given the noisy sequence h = H(/)^ the 
unique minimum L2 solution, h, is given by the following 
recurrence relation. Let u be v's parent: 

— ( z[v], if V is the root 

^""^ = I ^M + i(%l-E„e.«cc(u)^M), o.w. 

Theorem 3 shows that the overhead of computing H is 
minimal, requiring only two linear scans of the tree: a bot- 
tom up scan to compute z and then a top down scan to 
compute the solution h given z. 

4.2 Utility Analysis: the Accuracy of h 

We measure utility as accuracy of range queries, and we 
compare three strategies: L, H, and H. We start by com- 
paring L and H. 

Given range query q = c([x,y]), we derive an estimate 
for the answer as follows. For L, the estimate is simply the 
sum of the noisy unit-length intervals in the range: Lg = 
5^^^^ L[i]. The error of each count is and so the error 

for the range is error (Lq) — 0{{y — x) / e^). 

For H, we choose the natural strategy of summing the 
fewest sub-intervals of H. Let ri, . . . ,rt be the roots of dis- 
joint subtrees of T such that the union of their ranges equals 



[x,y]. Then Hg is defined as Hg = ^[n]. Each noisy 

count has error equal to (equal to the variance of the 

added noise) and the number of subtrees is at most 2i (at 
most two per level of the tree), thus error (Hq) = 0{£^ /e^). 

There is clearly a tradeoff between these two strategies. 
While L is accurate for small ranges, error grows linearly 
with the size of the range. In contrast, the error of H is 
poly-logarithmic in the size of the domain (recall that £ = 
B(logn)). Thus, while H is less accurate for small ranges, 
it is much more accurate for large ranges. If the goal of a 
universal histogram is to bound worst-case or total error for 
all range queries, then H is the preferred strategy. 

We now compare H to H. Since H is consistent, range 
queries can be easily computed by summing the unit-length 
counts. In addition to being consistent, it is also more ac- 
curate. In fact, it is in some sense optimal: among the 
class of strategies that (a) produce unbiased estimates for 
range queries and (b) derive the estimate from linear com- 
binations of the counts in h, there is no strategy with lower 
mean squared error than H. 

Theorem 4. (i) H is a linear unbiased estimator, (ii) 
error(Hg) < error{Eq) for all q and for all linear unbiased 
estimators E, (Hi) error (tlq) — 0{£^ /e^) for all q, and [iv) 
there exists a query q s.t. erroriVLq) < 2(£-i)(fc-i)-fc ^^^^^(fiq)- 

Part (iv) of the theorem shows that H can more accurate 
than H on some range queries. For example, in a height 16 
binary tree — the kind of tree used in the experiments — there 
is a query q where Hg is more accurate than Hg by a factor 

of = 9.33. 

Furthermore, the fact that H is consistent can lead to 
additional advantages when the domain is sparse. We pro- 
pose a simple extension to H: after computing /i, if there 
is a subtree rooted at v such that h[v] < 0, we simply set 
the count at v and all children of v to be zero. This is a 
heuristic strategy; incorporating non-negativity constraints 
into inference is left for future work. Nevertheless, we show 
in experiments, that this small change can greatly reduce er- 
ror in sparse regions and can lead to H being more accurate 
than L even at small ranges. 

5. EXPERIMENTS 

We evaluate our techniques on three real datasets (ex- 
plained in detail in Appendix C): NetTrace is derived from 
an IP- level network trace collected at a major university; 
Social Network is a graph derived from friendship relations 
in an online social network site; Search Logs is a dataset of 
search query logs over time from Jan. 1, 2004 to the present. 
Source code for the algorithms is available at the first au- 
thor's website. 

5.1 Unattributed Histograms 

The first set of experiments evaluates the accuracy of con- 
strained inference on unattributed histograms. We compare 
S to the baseline approach S. Since s = S(/) is likely 
to be inconsistent — out-of-order, non-integral, and possibly 
negative — we consider a second baseline technique, denoted 
Sr, which enforces consistency by sorting s and rounding 
each count to the nearest non-negative integer. 

We evaluate the performance of these estimators on three 
queries from the three datasets. On NetTrace: the query 
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Figure 5: Error across varying datasets and e. Each 
triplet of bars represents the three estimators: S 
(Hght gray), Sr (gray), and S (black). 

returns the number of internal hosts to which each exter- 
nal host is connected 65K external hosts); On Social 
Network, the query returns the degree sequence of the graph 
{'^ IIK nodes). On Search Logs, the query returns the 
search frequency over a 3-month period of the top 20K key- 
words; position i in the answer vector is the number of times 
the z*^ ranked keyword was searched. 

To evaluate the utility of an estimator, we measure its 
squared error. Results report the average squared error over 
50 random samples from the differentially-private mecha- 
nism (each sample produces a new s). We also show results 
for three settings of e = {1.0,0.1,0.01}; smaller e means 
more privacy, hence more random noise. 

Fig 5 shows the results of the experiment. Each bar 
represents average performance for a single combination of 
dataset, e, and estimator. The bars represent, from left-to- 
right, S (light gray), Sr (gray), and S (black). The vertical 
axis is average squared error on a log-scale. The results in- 
dicate that the proposed approach reduces the error by at 
least an order of magnitude across all datasets and settings 
of e. Also, the difference between Sr and S suggests that 
the improvement is due not simply to enforcing integrality 
and non-negativity but from the way consistency is enforced 
through constrained inference (though S and Sr are compa- 
rable on Social Network at large e). Finally, the relative 
accuracy of S improves with decreasing e (more noise). Ap- 
pendix C provides intuition for how S reduces error. 

5.2 Universal Histograms 

We now evaluate the effectiveness of constrained inference 
for the more general task of computing a universal histogram 
and arbitrary range queries. We evaluate three techniques 
for supporting universal histograms L, H, and H. For all 
three approaches, we enforce integrality and non-negativity 
by rounding to the nearest non-negative integer. With H, 
rounding is done as part of the inference process, using the 
approach described in Sec 4.2. 

We evaluate the accuracy over a set of range queries of 
varying size and location. The range sizes are 2* for i = 
1, . . . , ^ — 2 where £ is the height of the tree. For each fixed 
size, we select the location uniformly at random. We report 
the average error over 50 random samples of / and and 
for each sample, 1000 randomly chosen ranges. 

We evaluate the following histogram queries: On Net- 
Trace, the number of connections for each external host. 



1 10 10^ 10^ 10^ 1 10 10^ 10^ 10"^ 1 10 10^ 10^ 10"^ 

Range size Range size Range size 

Figure 6: A comparison estimators L (circles), H (di- 
amonds), and H (squares) on two real- world datasets 
(top NetTrace, bottom Search Logs). 

This is similar to the query in Sec 5.1 except that here the 
association between IP address and count is retained. On 
Search Logs, the query reports the temporal frequency of 
the query term "Obama" from Jan. 1, 2004 to present. (A 
day is evenly divided into 16 units of time.) 

Fig 6 shows the results for both datasets and varying e. 
The top row corresponds to NetTrace, the bottom to Search 
Logs. Within a row, each plot shows a different setting of 
e G {1.0,0.1,0.01}. For all plots, the x-axis is the size of 
the range query, and the y-axis is the error, averaged over 
sampled counts and intervals. Both axes are in log-scale. 

First, we compare L and H. For unit-length ranges, L 
yields more accurate estimates. This is unsurprising since it 
is a lower sensitivity query and thus less noise is added for 
privacy. However, the error of L increases linearly with the 
size of the range. The average error of H increases slowly 
with the size of the range, as larger ranges typically require 
summing a greater number of subtrees. For ranges larger 
than about 2000 units, the error of L is higher than H; for 
the largest ranges, the error of L is 4-8 times larger than 
that of H (note the log-scale) . 

Comparing H against H, the error of H is uniformly lower 
across all range sizes, settings of e, and datasets. The rela- 
tive performance of the estimators depends on e. At smaller 
e, the estimates of H are more accurate relative to H and 
L. Recall that as e decreases, noise increases. This suggests 
that the relative benefit of statistical inference increases with 
the uncertainty in the observed data. 

Finally the results show that H can have lower error than 
L over small ranges, even for leaf counts. This may be sur- 
prising since for unit-length counts, the scale of the noise of 
H is larger than that of L by a factor of log n. The reduction 
in error is because these histograms are sparse. When the 
histogram contains sparse regions, H can effectively identify 
them because it has noisy observations at higher levels of 
the tree. In contrast, L has only the leaf counts; thus, even 
if a range contains no records, L will assign a positive count 
to roughly half of the leaves in the range. 

6. RELATED WORK 

Dwork has written comprehensive reviews of differential 
privacy [7, 8] ; below we highlight results closest to this work. 
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The idea of post-processing the output of a differentiahy 
private mechanism to ensure consistency was introduced in 
Barak et al. [1], who proposed a hnear program for making 
a set of marginals consistent, non-negative, and integral. 
However, unlike the present work, the post-processing is not 
shown to improve accuracy. 

Blum et al. [4] propose an efficient algorithm to publish 
synthetic data that is useful for range queries. In comparison 
with our hierarchical histogram, the technique of Blum et al. 
scales slightly better (logarithmic versus poly-logarithmic) 
in terms of domain size (with all else fixed). However, our 
hierarchical histogram achieves lower error for a fixed do- 
main, and importantly, the error does not grow as the size 
of the database increases, whereas with Blum et al. it grows 
with 0(A/'^/^) with N being the number of records (details 
in Appendix E). 

The present work first appeared as an arXiv preprint [13], 
and since then a number of related works have emerged, 
including additional work by the authors. The technique 
for unattributed histograms has been applied to accurately 
and efficiently estimate the degree sequences of large social 
networks [12]. Several techniques for histograms over hierar- 
chical domains have been developed. Xiao et al. [22] propose 
an approach based on the Haar wavelet, which is conceptu- 
ally similar to the H query in that it is based on a tree of 
queries where each level in the tree is an increasingly fine- 
grained summary of the data. In fact, that technique has er- 
ror equivalent to a binary H query, as shown by Li et al. [15], 
who represent both techniques as applications of the matrix 
mechanism, a framework for computing workloads of linear 
counting queries under differential privacy. We are aware of 
ongoing work by McSherry et al. [17] that combines hierar- 
chical querying with statistical inference, but differs from H 
in that it is adaptive. Chan et al. [5] consider the problem of 
continual release of aggregate statistics over streaming pri- 
vate data, and propose a differentially private counter that 
is similar to H, in which items are hierarchically aggregated 
by arrival time. The H and wavelet strategy are specifically 
designed to support range queries. Strategies for answering 
more general workloads of queries under differential privacy 
are emerging, in both the offline [11, 15] and online [20] 
settings. 

7. CONCLUSIONS 

Our results show that by transforming a differentially- 
private output so that it is consistent, we can boost accu- 
racy. Part of the innovation is devising a query set so that 
useful constraints hold. Then the challenge is to apply the 
constraints by searching for the closest consistent solution. 
Our query strategies for histograms have closed-form solu- 
tions for efficiently computing a consistent answer. 

Our results show that conventional differential privacy ap- 
proaches can add more noise than is strictly required by the 
privacy condition. We believe that using constraints may 
be an important part of finding optimal strategies for query 
answering under differential privacy. More discussion of the 
implications of our results, and possible extensions, is in- 
cluded in Appendix B. 
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APPENDIX 

A. NOTATIONAL CONVENTIONS 

The table below summarizes notational conventions used 
in the paper. 

Q Sequence of counting queries 

L Unit-Length query sequence 

H Hierarchical query sequence 

S Sorted query sequence 

7q Constraint set for query Q 

Q, L, H, S Randomized query sequence 

H, S Randomized query sequence, 

returning minimum L2 solution 

/ Private database instance 

L(/), H(/), S(/) Output sequence (truth) 

/ = Li{I),h = H(/), s = S(/) Output sequence (noisy) 

h = H^I) ,s = S{I) Output sequence (inferred) 



B. DISCUSSION OF MAIN RESULTS 

Here we provide a supplementary discussion of the results, 
review the insights gained, and discuss future directions. 

Unattributed histograms. The choice of the sorted query 
S, instead of L, is an unqualified benefit, because we gain 
from the inequality constraints on the output, while the sen- 
sitivity of S is no greater than that of L. Among other ap- 
plications, this allows for extremely accurate estimation of 
degree sequences of a graph, improving error by an order 
of magnitude on the baseline technique. The accuracy of 
the estimate depends on the input sequence. It works best 
for sequences with duplicate counts, which matches well the 
degree sequences of social networks encountered in practice. 

Future work specifically oriented towards degree sequence 
estimation could include a constraint enforcing that the out- 
put sequence is graphical^ i.e. the degree sequence of some 
graph. 

Universal histograms. The choice of the hierarchical count- 
ing query H, instead of L, offers a trade off because the sen- 
sitivity of H is greater than that of L. It is interesting that 
for some data sets and privacy levels, the effect of the H con- 
straints outweighs the increased noise that must be added. 
In other cases, the algorithms based on H provide greater ac- 
curacy for all but the smallest ranges. We note that in many 
practical settings, domains are large and sparse. The spar- 
sity implies that no differentially private technique can yield 
meaningful answers for unit-length queries because the noise 
necessary for privacy will drown out the signal. So while L 
sometimes has higher accuracy for small range queries, this 
may not have practical relevance since the relative error of 
the answers renders them useless. 

In future work we hope to extend the technique for uni- 
versal histograms to multi-dimensional range queries, and to 
investigate optimizations such as higher branching factors. 



Across both histogram tasks, our results clearly show that 
it is possible to achieve greater accuracy without sacrificing 
privacy. The existence of our improved estimators S and H 
show that there is another differentially private noise dis- 
tribution that is more accurate than independent Laplace 
noise. This does not contradict existing results because the 
original differential privacy work showed only that calibrat- 
ing Laplace noise to the sensitivity of a query was sufficient 
for privacy, not that it was necessary. Only recently has 
the optimality of this construction been studied (and proven 
only for single queries) [10]. Finding the optimal strategy 
for answering a set of queries under differential privacy is 
an important direction for future work, especially in light of 
emerging private query interfaces [16]. 

A natural goal is to describe directly the improved noise 
distributions implied by S and H, and build a privacy mech- 
anism that samples from it. This could, in theory, avoid 
the inference step altogether. But it is seems quite difficult 
to discover, describe, and sample these improved noise dis- 
tributions, which will be highly dependent on a particular 
query of interest. Our approach suggests that constraints 
and constrained inference can be an effective path to dis- 
covering new, more accurate noise distributions that satisfy 
differential privacy. As a practical matter, our approach 
does not necessarily burden the analyst with the constrained 
inference process because the server can implement the post- 
processing step. In that case it would appear to the analyst 
as if the server was sampling from the improved distribution. 

While our focus has been on histogram queries, the tech- 
niques are probably not limited to histograms and could 
have broader impact. However, a general formulation may 
be challenging to develop. There is a subtle relationship be- 
tween constraints and sensitivity: reformulating a query so 
that it becomes highly constrained may similarly increase its 
sensitivity. A challenge is finding queries such as S and H 
that have useful constraints but remain low sensitivity. An- 
other challenge is the computational efficiency of constrained 
inference, which is posed here as a constrained optimization 
problem with a quadratic objective function. The complex- 
ity of solving this problem will depend on the nature of the 
constraints and is NP-Hard in general. Our analysis shows 
that the constraint sets of S and H admit closed-form solu- 
tions that are efficient to compute. 

C. ADDITIONAL EXPERIMENTS 

This section provides detailed descriptions of the datasets, 
and additional results for unattributed histograms. 

NetTrace is derived from an IP-level network trace col- 
lected at a major university. The trace monitors traffic at 
the gateway between internal IP addresses and external IP 
addresses. From this data, we derived a bipartite connec- 
tion graph where the nodes are hosts, labeled by their IP 
address, and an edge connotes the transmission of at least 
one data packet. Here, differential privacy ensures that in- 
dividual connections remain private. 

Social Network is a graph derived from friendship rela- 
tions on an online social network site. The graph is limited 
to a population of roughly 11,000 students from a single 
university. Differential privacy implies that friendships will 
not be disclosed. The size of the graph (number of students) 
is assumed to be public knowledge.^ 



^This is not a critical assumption and, in fact, the number 
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Figure 7: On NetTrace, S(/) (solid gray), the average error of S (solid black) and S (dotted gray), for e— 1.0. 



Search Logs is a dataset of search query logs over time 
from Jan. 1, 2004 to the present. For privacy reasons, it is 
difficult to obtain such data. Our dataset is derived from a 
search engine interface that publishes summary statistics for 
specified query terms. We combined these summary statis- 
tics with a second dataset, which contains actual search 
query logs but for a much shorter time period, to produce a 
synthetic data set. In the experiments, ground truth refers 
to this synthetic dataset. Differential privacy guarantees 
that the output will prevent the association of an individual 
entity (user, host) with a particular search term. 

Unattributed histograms. Figure 7 provides some intu- 
ition for how inference is able to reduce error. Shown is a 
portion of the unattributed histogram of NetTrace: the se- 
quence is sorted in descending order along the x-axis and the 
y-axis indicates the count. The solid gray line corresponds 
to ground truth: a long horizontal stretch indicates a sub- 
sequence of uniform counts and a vertical drop indicates a 
decrease in count. The graphic shows only the middle por- 
tion of the unattributed histogram — some very large and 
very small counts are omitted to improve legibility. The 
solid black lines indicate the error of S averaged over 200 
random samples of S (with e = 1-0); the dotted gray lines 
indicate the expected error of S. 

The inset graph on the left reveals larger error at the be- 
ginning of the sequence, when each count occurs once or 
only a few times. However, as the counts become more con- 
centrated (longer subsequences of uniform count), the error 
diminishes, as shown in the right inset. Some error remains 
around the points in the sequence where the count changes, 
but the error is reduced to zero for positions in the middle 
of uniform subsequences. 

Figure 7 illustrates that our approach reduces or elimi- 
nates noise in precisely the parts of the sequence where the 
noise is unnecessary for privacy. Changing a tuple in the 
database cannot change a count in the middle of a uniform 



of students can be estimated privately within zbl/e in ex- 
pectation. Our techniques can be applied directly to either 
the true count or a noisy estimate. 



subsequence, only at the end points. These experimental 
results also align with Theorem 2, which states that the er- 
ror of S is a function of the number of distinct counts in 
the sequence. In fact, the experimental results suggest that 
the theorem also holds locally for subsequences with a small 
number of distinct counts. This is an important result since 
the typical degree sequences that arise in real data, such 
as the power-law distribution, contain very large uniform 
subsequences. 

D. PROOFS 

Proof of Proposition 2. For any output q, then let 
S{^) denote the set of noisy answers such that if ^ G S{q) 
then the minimum L2 solution given q and 7q is q. For 
any / and I' G nbrs{I), the following shows that Q is e- 
differentially private: 

Pr[Q(7) = q]= Pr[Q(/) G S{q)] 

< exp(e) Pr[Q(7') G S{q)] 

= exp(e) Pr[Q(7') = q] 

where the inequality is because Q is e-differentially pri- 
vate. □ 

Proof of Proposition 3. Given a database /, suppose 
we add a record to it to obtain /^ The added record affects 
one count in L, i.e., there is exactly one i such that L(/)[i] = 
X and L(/^) [i] = x+1, and all other counts are the same. The 
added record affects S as follows. Let j be the largest index 
such that S(/)[j] = x, then the added record increases the 
count at j by one: S{I')[j] = x + 1. Notice that this change 
does not affect the sort order — i.e., in S(/^), the j^^ value 
remains in sorted order: S{I')[j — 1] < x, S{I')[j] = x + 1, 
and S{I')[j + 1] > X + 1. All other counts in S are the same, 
and thus the Li distance between S(/) and S(/^) is 1. □ 

Proof of Proposition 4. If a tuple is added or removed 
from the relation, this affects the count for every range 
that includes it. There are exactly i ranges that include 
a given tuple: the range of a single leaf and the ranges of 
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the nodes along the path from that leaf to the root. There- 
fore, adding/removing a tuple changes exactly I counts each 
by exactly 1. Thus, the sensitivity is equal to £, the height 
of the tree. □ 

D.l Proof of Theorem 1 

We first restate the theorem below. Recall that 
denotes the subsequence of j — z + 1 elements: s[i + 1], 

. . . , s[j]). Let M[i^j] record the mean of these elements, i.e. 
M[i,j]=Tfk=^m/U-i + ^)■ 

Theorem 1. Denote Lk = minj^^^^ri] maxj^[i^j] M[z, j] and 
Uk = max^^[i minj^[^ M[z, j]. The minimum L2 solu- 
tion s, is unique and given by: s[k] = Lk = Uk- 

Proof. In the proof, we abbreviate the notation and im- 
plicitly assume that the range of i is [1, n] or [1, j] when j is 
specified. Similarly, the range of j is [l,n] or [i,n] when i is 
specified. 

We start with the easy part, showing that Uk < Lk. De- 
fine an n X n matrix as follows: 

^ r M[i,j] ifi<j 

Aij = <oo ifj<i<k 
—00 otherwise 

Then min^ max^ Aij — Lk and max^ min^ A^j — Uk- In any 
matrix A^ , max^ min^ A^j < miuj max^ A^j : this is a simple 
fact that can be checked directly, or see [19], hence Uk ^ Lk- 
We show next that if s is the minimum L2 solution, then 
Lk < s[k] < Uk- If we show this, then the proof of the 
theorem is completed, as then we will then have s[k] = Lk = 
Uk- The proof relies on the following lemma. 

Lemma 1. Let s be the minimum L2 solution. Then (i) 
s[l] < Ui, (ii)s[n] > Ln, (Hi) for all k, min(s[/c+l], maxj M[z, /c]) 
s[k] < max(s[/c — 1], min^ M[/c, j]). 



We now claim that — s||2 < — s||2. For this note that 
since the sequence s'[j+l^ n] is identical to the sequence s[j+ 
l,n], it is sufficient to prove ||s^[l,j] — s[l,j]||2 < ||s[l,j] — 
s[l, j]||2. To prove that, note that ||s[l, j] — s[l, j]||2 can be 
expanded as 

\\-s[i,3]-~s[i,3\\\2 = Y.m-~s[i\f = i2^s[i]+5i-mf 

= Y^W,3\ + 5 + 5,-s\i\f 

i=l 

Suppose for a moment that we fix M[l, j] and (^i's, and treat 
||s[l, j] — s[l, j]||2 as a function / over S. The derivative of 
m is: 

f'{5) = 2j2mi,j]+S + S,-s[i]) 
i=l 

j j 

= 2{jM[l,j]-^s[i])+2j5 + 2j2Si 

j 

i=i 

Since 6i > for all z, then the derivative is strictly greater 
than zero for any ^ > 0, which implies that / is a strictly 
increasing function of S and has a minimum aX, S = 0. There- 
fore, \\s[l,j] - s[l,j]\\2 = f{5) > /(O) = \\s'[l,j] - s[l,i]||2. 
This is a contradiction since it was assumed that s was the 
minimum solution. This completes the proof for (i). 

For (ii), the proof of s[n] > max^ M[z,n] follows from 
a similar argument: if s[n] < M[i,n] for some i, define 
S = M[i,n] — s[n] and the sequence s' with elements s'[j] = 
^[j] + 6 for j > i. Then s' can be shown to be a strictly 
better solution than s, proving (ii). 



The proof of the lemma appears below, but now we use 
it to complete the proof of Theorem 1. First, we show that 
s[k] < Uk using induction on k. The base case is k = 1 and 
it is stated in the lemma, part (i). For the inductive step, 
assume s[k — 1] < Uk-i- From (iii), we have that 

s[k] < max(s[fc - 1], minM[fc, j]) 

j 

< max(t/A;_i, minM[/c, j]) = Uk 

j 

The last step follows from the definition oi Uk- A similar 
induction argument shows that s[k] > Lk, except the order 
is reversed: the base case is k = n and the inductive step 
assumes s[k + 1] > Lk+i- □ 

The only remaining step is to prove the lemma. 

Proof of Lemma 1. For (i), it is sufficient to prove that 
s[l] < M[l, j] for all j G [l,n]. Assume the contrary. Thus 
there exists a j such that for s[l] > M[l, j]. Let S — s[l] — 
M[l, j]. Thus > 0. Further, for ah i, denote 5i = s[i]-s[l]. 
Consider the sequence defined as follows: 

_ / s\i]-S if i <J 
^ \ otherwise 

It is obvious to see that since s is a sorted sequence, so is s' . 



For the proof of (iii), we first show that s[k] < max{s[k — 
1], min j M[k, j]). Assume the contrary, i.e. there exists a 
k such that s[k] > s[k — 1] and s[k] > minj M[k, j]. In 
other words, we assume there exists a k and j such that 
s[k] > s[k - 1] and s[k] > M[k,j]. Denote 6 = s[k] - 
max(s[A; — l],M[/c,j]). By our assumption above, ^ > 0. 
Define the sequence 

-/r-i _ / s[i] — 6 ii k < i < j 
]^ s[i] otherwise 

Note that by construction, s'[k] — s[k\ — 5 — s[k\ — {s[k] — 
max(s[/c — l],M[A;,j])) = max(s[A; — l],M[/c,j]). It is easy 
to see that s' is sorted (indeed the only inversion in the sort 
order could have occurred if s' [k — 1] > s' [k] , but doesn't as 
s'[k -1] = s[k - 1] < max(s[A: - l],M[kJ]) = s'[k]). 

Now a similar argument as in the proof of (i) for the se- 
quence s[k,j], yields that the error ||s^[A;,j] — s[/c, j]||2 < 
\\s[kj] - s[kj]\\2- Thus - s||2 < - s||2 and s' 
is a strictly better solution than s. This yields a contra- 
diction as s is the minimum L2 solution. Hence s[k] < 
max(s[/c — l],minj M[k,j]). 

A similar argument in the the reverse direction shows 
that s[k] > min(sA;+i, maxi M[z, A;]) completing the proof of 

(iii). □ 
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D.2 Proof of Theorem 2 

We first restate the theorem below. Denote n and d as 
the number of values and the number of distinct values in 
S(/) respectively. Let ni, 712, . . . , rid be the number of times 
each of the d distinct values occur in S(/) (thus J2i = n). 

Theorem 2. There exist constants ci and C2 independent 
of n and d such that 

/ log^ rii + 02 
error[b) < y 2 

i=l ^ 

Thus error{S) = 0{d\og^ n/e^) whereas error{S) = B(n/e^). 

Before showing the proof, we prove the following lemma. 

Lemma 2. Let s = S(/) be the input sequence. Call a 
translation of s the operation of subtracting from each ele- 
ment of s a fixed amount 6. Then error {S[i]) is invariant 
under translation for all i. 

Proof. Denote Pr{s\s) {Pr{s\s)) the probability that 
s (s) is output on the input sequence s. Denote s , s' , and 
s the sequence obtained by translating s, s, and s by S, 
respectively. 

First observe that Pr{s\s) = Pr{s'\s') as s and s' are 
obtained by adding the same Laplacian noise to s and s\ 
respectively. Using Theorem 1 (since all Uk^s and L^'s shift 
by S on translating s by delta) , we get that if s is the mini- 
mum L2 solution given s, then s' is the minimum L2 solution 
given s . Thus, Pr{s\s) — Pr{s'\s) for all sequences s. Fur- 
ther, since s[i] and s'[i] yield the same L2 error with s[i] 
and s[i] respectively, we get that the expected error {S[i]) 
is same for both inputs s and s . □ 

Lemma 3. Let X be any positive random variable that is 
bounded (iimx ^ 00 xPr{X > x) exists). Then 



E(X) < / Pr{X > x)dx 
Jo 

Proof. The proof follows from the following chain of 
equalities. 



Jo 



d 



x-^ (Prix > x)) 



- -f 

Jo 

/»oo 

= -[xPr{X > x)]^ + {Pr{X < x) - l)dx (by parts) 
Jo 

POO 

= - lim xPr{X > x) -\- Pr{X > x)dx 

POO 

< / Pr{X > x)dx 
Jo 

Here the last equality follows as X is bounded and there- 
fore the limit exists and is positive. This completes the 
proof. □ 

We next state a theorem that was shown in [6] 

Theorem 5 (Theorem 3.4 [6]). Suppose that Xi, X2, 
. . . , Xn are independent random variables satisfying Xi < 
E(Xi) -h M, for 1 < i < n. We consider the sum X = 



'^^^i Xi with expectation ¥.[X) — 5^^^^ E(Xi) andVar(X) = 
^^=iVar{Xi). Then, we have 

Pr{X > E{X) + A) < e2(Var(X) + MA/3) 

For a random variable X, denote Ix the indicator function 
that X >0 (thus Ix = 1 if ^ > and otherwise). Using 
Theorem 5, we prove the following lemma. 

Lemma 4. Suppose i,j are indices such that for all k G 
[i,j], s[k] < 0. Then there exists a constant c such that for 
all T > 1 the following holds. 



Pr(M[i,j] 



2 log^((i-» + l)r) 



Proof. We apply Theorem 5 on s[k] for k G [i,j]- First 
note that E{s[k]) = s[k] < 0. Further Var{s[k]) = ^ as s[k] 
is obtained by adding Laplace noise to s[k] which has this 
variance. We also know that s[k] > M + s[k] happens with 
probability at most e~^^/2. 

For simplicity, call n to he j — i -\- 1. Denoting X = 
Y.ke[ij] ^[^]' ^^^^ ^(^) ^ Var{X) = ^. Fur- 

ther, set M = 3 log (nT)/e. Denote B the event that for 
some k, s[k] >M + s[k\. Thus Pr{B) < ne"'^/2 < 
If B does not happen, we know that s[k\ < M + s[k] for all 
k G [i,j]- Thus we can then apply Theorem 5 to get: 



Pr{X >E{X)+X) < e2(2n/e^ + Alog(nx)/e) + Pr(5) 



g2(2n/e-: + Alog(nx)/e) _^ 



2n2r3 



Setting A = I ^/ri log (nr ) gives us that 

Pr (x > E(X) + - log (nr) ) < 
Since E{X) < 0, we get 

Pr(x>lv^\og{nr)^ < ^ 
Also we observe that M[i,j] = X/n, which yields 



Finally, observe that M[i,j] < c implies that M[i, < 
c^. Thus we get 

641og^ (nr) 



Pr[M[iJYlmm[iJ] > 



Putting n = j — i-\-l and using c = 64 gives us the required 
result. □ 

Now we can give the proof of Theorem 2. 

Proof of Theorem 2. The proof of error(S) = 6(n/e^) 
is obvious since: 

^ ^ 2 
error{S) = error {s[i]) = ti{ — ) 
k=i ^ 

In the rest of the proof, we shall show bound error (S). 
Let s — S{I) be the input sequence. We know that s consists 
of d distinct elements. Denote Sr as the r*^ distinct element 
of s. Also denote [Zrjt^r] as the set of indices corresponding 
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to Sr, i.e. yi^[i^^ur-]s[i] = Sr and Vi^[^^,^,^]s[^] 7^ Sr. Let 
M[z, j] record the mean of elements in s[i,j], i.e. M[z, j] = 

Ei=i#]/(»-j+i)- 

To bound error (S), we shall bound error(S[z]) separately 
for each i. To bound error(S[i]), we can assume W.L.O.G 
that s[i] is 0. This is because if / 0, then we can trans- 
late the sequence s by s[i]. As shown in Lemma 2 this 
preserves error (S[i]), while making s[i] = 0. 

Let /c G [/r, Ur] be any index for the r*^ distinct element of 
s. By definition, error(S[A:]) = E{s[k]-s[k]f = E{s[k]^) (as 
we can assume W.L.O.G s[k] = 0). From Theorem 1, we 
know that s[k] = Uk. Thus error(S[A;]) = E(t/^). Here we 
treat Uk — maxi<kminj M[i, j] as a random variable. Now 
by definition of E, we have 



■■HUklu^)+E{uUl 



■fcj) + ^ (say) 



We shall bound A and B separately. For bounding A, 
denote Uk — maxi<kM[i^Ur\. It is apparent that lAk > Uk 
and thus Uk^u^ ^ Uklu^ • To bound A, we observe that 



Further, since Uk — maxi<kM[i, Ur] , we know that U^lu^ 
maxi<kM[i^UrYl^u u ]• Thus we can write: 



Let T > 1 be any number and c be the constant used in 
Lemma 4. Let us denote ci the event that: 



log^ {{ur - i + l)r) 
{ur — i -\- 



We can apply lemma 4 to compute the probability of Ci as 
< for all j < Ur (as we assumed W.L.O.G s[k] — 0). 
Thus we get Pr{ei) < (^^,^^1)2^2 • 

Define e = V^^^e^. Then Pr(e) < Pr^a) = (as 

X^r=i -'-/^^ — -'-^ event e does not happen, then it is 
easy to see that 



Ukluk = maXi<kM[i,Ur]'^I 



log^ {{Ur 



M[i,Ur] 



To obtain a bound on the total error (S). 

d 

error (S) = error (S [A:]) 

r=l kE[lr,Ur] 

Cl log^ (iXr - A: + 1) + C2 

{ur-k + l)e2 

Cl log^ (/C - + 1) + C2 

(A; - + l)e2 



< 



E E 



^ E 



Cl log (Wr - Zr + 1) + C2 



(^1^ - A: + l)e2 



Finally noting that i^r — + 1 is just rir, the number of 
occurrences of Sr in s, we get error{S) = ci iog^nr+c2 _ 
0(dlog^ n/e^). This completes the proof of the theorem. □ 

D.3 Proof of Theorem 3 

We first restate the theorem below. 

Theorem 3. Given the noisy sequence h = H(/); the 
unique minimum L2 solution, h, is given by the following 
recurrence relation. Let u be v's parent: 

— ( z[v], if V is the root 

^^"^ = I ^M + K%]-E.e..cc(.)^M), o.w. 

Proof. We first show that h[r] = z[r] for the root node 
r. By definition of a minimum L2 solution, the sequence 
h satisfies the following constrained optimization problem. 
Let succZ\u] = ^ r \ z\w]. 

minimize ^^(/?'M — ^H)^ 

V 

subject to Vt', h[u] — h[v] 

uEsucc(v) 

Denote leaves{v) to be the set of leaf nodes in the subtree 
rooted at v. The above optimization problem can be rewrit- 
ten as the following unconstrained minimization problem. 

minimize ^^(( ^W) ~ ^M)^ 

V l£leaves(v) 

For finding the minimum, we take derivative w.r.t h[l] for 
each / and equate it to 0. We thus get the following set of 
equations for the minimum solution. 



Thus with at least probability 1 — (which is Pr(-ie)), 
we get Uk^Uk is bounded as above. This yields that there ex- 
ist constants ci and 02 such that E(Z^|%) < ^''fu^ll~^\'^^2^'''' • 
The proof is by the application of Lemma 3 (as Uk is bounded) 
and a simple integration over r ranging from 1 to 00. Finally 

we get that A < E{Uilu,) < ^7^,^:;;ite2^^^' • 

Recall that B = E([/|(l - I[/J). We can write B as 
E(L|(1 — iLfc)) as Lk — Uk- Using the exact same ar- 
guments as above for Lk but on sequence — S yields that 

D ^ Cl log^ (fc — ^y, + l) + C2 

— (fc-i^ + l)e2 

Finally, we get that S[A;] = A -\- B which is less than 



Cl log (Uy — fc + l)+C2 

(nr-A;+l)e2 



Cl log (fc — ^r + l)+C2 
(A;-Zr + l)e2 



vz, 



2(( h[l'])-h[v])=0 



v:lEleaves{v) I' ^leaves{v) 

Since ^ieaves(v) ^V'] ~ ^^1' abovc set of equations 

can be rewritten as: VZ, T.^,ieieaves(v) h[v] = T.v:ieieaves(v) 

For a leaf node /, we can think of the above equation for I 
as corresponding to a path from / to the root r of the tree. 
The equation states that sum of the sequences h and h over 
the nodes along the path are the same. We can sum all the 
equations to obtain the following equation. 

E E %] = E E 



V lEleaves(v) 



V lEleaves(v) 
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Denote level (i) as the set of nodes at height i of the tree. 
Thus root belongs to level{i — 1) and leaves in level{0). 
Abbreviating LHS (RHS) for the left (right) hand side of 
the above equation, we observe the following. 

LHS = E 

V l£leaves(v) 

= £ E E h[v] 

i=0 v^level{i) lEleaves^v) 

= E E mv] = j2^' E M«] 

i=0 v^level(i) i=0 vElevel{i) 

i=0 

Here we use the fact that ^y^ievei{i) ^l^] ~ 
level i. This is because h satisfies the constraints of the tree. 
In a similar way, we also simplify the RHS. 

RHS = E 

V l£leaves{v) 

= E E E '^H 

i=0 v^level{i) l^leaves(v) 

= E E fc'%] = Efc' E 

«=0 vElevel(i) i=0 vElevel{i) 

Note that we cannot simplify the RHS further as h[v] may 
not satisfy the constraints of the tree. Finally equating LHS 
and RHS we get the following equation. 



%l = F^Efc^ E h[v] 

i=0 vElevel(i) 

Further, it is easy to expand z[r] and check that 



-H = ^E^' E 

i=0 vElevel(i) 

Thus we get /i[r] = z[r]. For nodes other than the r, 
assume that we have computed h[u] for u = pred(v). Denote 
H = h[u]. Once H is fixed, we can argue that the value of 
h[v] will be independent of the values of h[w] for any w not 
in the subtree of u. 

For nodes w G subtree{u) the L2 minimization problem is 
equivalent to the following one. 

minimize {h[w] — hlw])^ 

'w^subtree(u) 

subject to yw G subtree(u), h[w'] = h[w] 

w' Esucc{w) 

and ^ = H 

vEsucc(u) 

Again using nodes I G leaves{u), we convert this mini- 
mization into the following one. 



minimize (( h[l]) — h[wyj'^ 

w£subtree{U) l£leaves{w) 

subject to h[u] = H 

lEleaves{u) 

We can now use the method of Lagrange multipliers to 
find the solution of the above constrained minimization prob- 
lem. Using A as the Lagrange parameter for the constraint 
Y^ieieaves{u) ^ followiug scts of cquatious. 

VZ G leaves{u), ^ 2{h[w] - h[w]) = -X 

w:lEleaves{w) 

Adding the equations for all I G leaves{u) and solving 
for A we get A = Here n{u) is the number of 

nodes in suhtree{u). Finally adding the above equations for 
only leaf nodes / G leaves{v), we get 



h[v] = z[v] - {n{v) - 1) • A 



= zv] + 



n{v) 



'-{H - succZ[u\]) 
- succZ[u\) 



n(u) — 

This completes the proof. □ 

D.4 Proof of Theorem 4 

First, the theorem is restated. 



Theorem 4. (i) H Z5 a linear unbiased estimator, (ii) 
error (FLq) < error (Eq) for all q and for all linear unbiased 
estimators E, {Hi) error(Hg) = 0{i^ /e^) for all q, and (iv) 



there exists a query q s.t. error (Hg) < 



2(€-l)(fc-l)-fc 



Proof. For (i), the linearity of H is obvious from the 
definition of z and h. To show H is unbiased, we first show 
that z is unbiased, i.e. E(2:[?;]) = h[v]. We use induction: 
the base case is if ^' is a leaf node in which case E(2;['i;]) = 
E(/i[i;]) — h[v]. If V is not a leaf node, assume that we have 
shown z is unbiased for all nodes u G succ{v). Thus 

E{succZ[v]) = = h[u] = h[v] 



uEsucc(v) 



uEsucc(v) 



Thus succZ[v] is an unbiased estimator for h[v]. Since z[v] 
is a linear combination of h[v] and succZ[v], which are both 
unbiased estimators, z[v] is also unbiased. This completes 
the induction step proving that z is unbiased for all nodes. 
Finally, we note that h[v] is a linear combination of h[v], z[v], 
and succZ[v], all of which are unbiased estimators. Thus 
h[v] is also unbiased proving (i). 

For (ii), we shall use the Gauss-Markov theorem [21]. We 
shall treat the sequence h as the set of observed variables, 
and I, the sequence of original leaf counts, as the set of 
unobserved variables. It is easy to see that for all nodes v 

— Y2 l[u] -\- noise{v) 

uEleaves(v) 

Here noise{v) is the Laplacian random variable, which is in- 
dependent for different nodes v, but has the same variance 
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for all nodes. Hence h satisfies the hypothesis of Gauss- 
Markov theorem, (i) shows that /i is a linear unbiased 
estimator. Further, h has been obtained by minimizing 
the L2 distance with h[v]. Hence, h is the Ordinary Least 
Squares (OLS) estimator, which by the Gauss- Markov the- 
orem has the least error. Since it is the OLS estimator, it 
minimizes the error for estimating any linear combination 
of the original counts, which includes in particular the given 
range query q. 

For (iii), we note that any query q can be answered by 
summing at most ki nodes in the tree. Since for any node 
V, error(H.[v]) < error(il[v]) = 2-£^/e^, we get 

error(H[q]) < M{2i^/e^) = 0{t j e) 

For (iv), denote l\ and I2 to be the leftmost and rightmost 
leaf nodes in the tree. Denote r to be the root. We consider 
the query q that asks for the sum of all leaf nodes except for 
l\ and /2. Then from (i) error (H(^)) is less than the error of 
the estimate h\r\ — — /i[/2], which is 6-£^/e^. But, on the 
other hand, H will require summing 2(k—V)[t—V) — k noisy 
counts in total — 2(k — 1) at each level of the tree, except for 
the root and the level just below the root, only k — 2 counts 
are summed. Thus error(Hq) = 2(2(fc - 1)(£ - 1) - fc)^^^^- 
Thus 



error(Hg) < 



3error(Hq) 



2[i-\)(k-\)-k 
This completes the proof. □ 



Both techniques scale at most poly-logarithmically with the 
size of the domain. However, the H scales better with e, 
achieving the same utility guarantee with a database that is 
smaller by a factor of 0(l/e^). 

The above comparison reveals a distinction between the 
techniques: for Hg the bound on absolute error is indepen- 
dent of database size, i.e., it only depends on e, a, and the 
size of range. However, for the Blum et al. approach, the 
absolute error increases with the size of the database at a 
rate oiO[N'^l'^). 



E. COMPARISON WITH BLUM ET AL. 

We compare a binary Hq against the binary search equi- 
depth histogram of Blum et al. [4] in terms of (e, (5)-usefulness 
as defined by Blum et al. Since e is used in the usefulness 
definition, we use ot as the parameter for a-differential pri- 
vacy. 

Let N be the number of records in the database. An algo- 
rithm is (e, 5)-useful for a class of queries if with probability 
at least 1 — (5, for every query in the class, the absolute error 
is at most eN . 

For any range query ^, the absolute error of Hg is |Hg(/) — 
Hg(/)| = |y| where Y — Y^^^-^'^i^ each 7^ ^ Lap{l/a)^ and 
c is the number of subtrees in Hg, which is at most 21. We 



use Corollary 1 from [5] to bound the error of 
Laplace random variables. With v 
obtain 

mi In ^ 



sum of 



Pr 



\Y\ < 



>l-5' 



The above is for a single range query. To bound the error 
for all (2) range queries, we use a union bound. Set 5' = 

Then H is (e, 5)-useful provided that e > [lUi In /a. 

As in Blum et al., we can also fix e and bound the size of 
the database. H is (e, (5)-useful when 



N > 



16^2 In 



2n^ 
S 



= 



ae 



log 2 n (logn-hlog|] 



In comparison, the technique of Blum et al. is (e, ^)-useful 
for range queries when 



N>0 



logn(loglogn-hlog^ 
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